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Abstract 

Over the past few years, a family of interesting new inequalities for the entropies of sums and differences 
of random variables has been developed by Ruzsa, Tao and others, motivated by analogous results in additive 
combinatorics. The present work extends these earlier results to the case of random variables taking values in IR n 
or, more generally, in arbitrary locally compact and Polish abelian groups. We isolate and study a key quantity, 
the Ruzsa divergence between two probability distributions, and we show that its properties can be used to extend 
the earlier inequalities to the present general setting. The new results established include several variations on the 
theme that the entropies of the sum and the difference of two independent random variables severely constrain 
each other. Although the setting is quite general, the result are already of interest (and new) for random vectors in 
R”. In that special case, quantitative bounds are provided for the stability of the equality conditions in the entropy 
power inequality; a reverse entropy power inequality for log-concave random vectors is proved; an information- 
theoretic analog of the Rogers-Shephard inequality for convex bodies is established; and it is observed that some 
of these results lead to new inequalities for the determinants of positive-definite matrices. Moreover, by considering 
the multiplicative subgroups of the complex plane, one obtains new inequalities for the differential entropies of 
products and ratios of nonzero, complex-valued random variables. 
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I. Introduction 

A. Motivation 

T HE properties of the entropy of sums and differences of random variables have attracted a great deal of interest 
in almost every area of information theory. Classical results were primarily motivated by the study of additive 
noise channels, and in the past three decades connections with several other fields have emerged, including the 
foundations of probabilistic limit theorems, functional inequalities and probabilistic bounds. 

More recently, it was also observed that inequalities involving the entropies of sums and differences are closely 
tied to basic questions and results in the area of additive combinatorics, which in turn also have applications in 
communications. A prominent collection of tools in additive combinatorics are those provided by the Pliinnecke- 
Ruzsa sumset theory; see, e.g., [55] for a broad introduction. A simple example of such a result is the following. 
Given two discrete sets A and B, the sumset A + B is defined as, A + B = {a + b : a € A, b G B}, and the 
difference set A — B is, A — B = {a — b : a £ A. b £ B}. The Ruzsa triangle inequality [4f ] states that, for any 
three sets A, B , C, we have, 

\A-C\ ■ \B\ < \A-B\ ■ \B-C\, (1) 

where \E\ denotes the cardinality of a set E, and A, B and C are subsets of the integers, or any other discrete abelian 
group. A fascinating connection between such inequalities and corresponding results for the Shannon entropy H 
was identified initially by Tao and Vu [56] and by Ruzsa [49], and it has been developed quite extensively by several 
authors over the past 10 years; see, e.g., [34], [54] and the references therein. The main idea is that, interpreting 
the entropy as the effective log-cardinality of the support of a random variable, then replacing the log-cardinality 
of every sumset (or difference set) by the entropy of a corresponding sum (respectively, difference) of independent 
discrete random variables, produces a candidate entropy inequality. For example, (1) becomes, 

H(X-Z) + H(Y) <H(X-Y) + H(Y-Z), (2) 


for independent X, Y, Z, where H denotes the Shannon entropy. 

For discrete random variables, this connection was studied in detail by [34] and Tao [54], who established 
numerous such entropy inequalities. The main technical tool in Tao’s proofs was the submodularity property of the 
discrete entropy, which, as observed in our subsequent work [30], fails to hold in the case of differential entropy. 
Therefore, in order to extend Tao’s results to continuous random variables, new arguments were necessary, and the 
key property which replaced submodularity in the proofs of almost all of the corresponding differential entropy 
inequalities in [30], was the data processing inequality for mutual information. As for the results of [34], some 
of them can be extended without too much effort to continuous random variables, while others rely too much on 
the bijection-invariance of discrete entropy and cannot be extended both because of this and because of delicate 
measure-theoretic issues that only arise in the continuous case. 

The starting point of the present work is the desire to explore how this family of inequalities can be extended 
to random vectors M n and, more generally, to random variables taking values in general (locally compact, Polish) 
abelian groups. Our main results, outlined below, include unified proofs for many of the earlier results in [54], [34] 
and all the results of [30]; a key ingredient in our approach is the identification of the Ruzsa divergence as the 
central quantity of interest. 

We note in passing that strong communication-theoretic motivation for the present work comes from the fact that 
our results can be used powerfully in the study of the degrees of freedom of interference channels (for which the 
computation of fundamental limits is a notoriously hard open problem). The results of our prior work [30] played 
a key role in the works of Wu, Shamai and Verdu [5' ] and Stotz and Bolcskei [5 ], [53]; we anticipate that the 
more general results developed herein will also find applications to communication theory. 


B. Outline of main results 

The first contribution of this work is to isolate and study, in a general setting, a quantity that plays a key role 
in the behavior of entropy of sums and differences; we call this the Ruzsa divergence. Let X and Y denote two 
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random variables which can be discrete, continuous, vector-valued, or, more generally, taking values in a locally 
compact abelian group G. The Ruzsa divergence between X and Y is defined as, 1 

d R {X\\Y) := h{X' - Y') - h(X') = I{X' - Y 1 ; Y% 

where X' and Y' are independent and have the same marginal distributions as X and Y, respectively, and h denotes 
the entropy on G. As described formally in the following section, h is the usual Shannon entropy if G is discrete, it 
is the (joint) differential entropy when G = R n , and in general it is the entropy defined with respect to Haar measure 
on G. Much of the remainder of this section will summarize how the basic properties of the Ruzsa divergence can 
be used to provide unified proofs for all existing (discrete and continuous) entropy inequalities in this area, as well 
as their extensions to general groups, offering an analysis on spaces satisfying essentially minimal assumptions - 
specifically, on abelian groups equipped with the minimal topological structure necessary to guarantee the existence 
of a Haar measure so that a natural notion of entropy can be defined. 

The second contribution of this work is to highlight some interesting connections of the aforementioned techniques 
and ideas with problems related to the differential entropies of products of positive random variables, the entropy 
power inequality, results in convex geometry, and determinantal inequalities. 

We begin in Section II by introducing the main definitions and assumptions that will remain in effect throughout 
the paper. We first formally define the Ruzsa divergence dii(X\\Y), as well as two related quantities, the conditional 
Ruzsa divergence and the Ruzsa difference. After some elementary observations, we then state in Theorem 1 the 
triangle inequality for dii(X\\Y), which implies the inequality (2), and which is seen to be a simple consequence 

of a stronger result. Theorem 2. This is stated and proved in Section III, where we also establish a number of 

the important properties of dji(X\\Y). In Theorem 3 we show that it is subadditive with respect to convolution, 
dn(X\\Y] + y 2 ) < df { (X\\ Y\) + d R (X\\Y 2 ), and in Theorem 5 we give a general information-theoretic version of 
the Balog-Szemeredi-Gowers theorem, a significant inequality from additive combinatorics. 

In Section IV we first re-interpret the subadditivity property of Theorem 3 in the context of important inequalities 
for the cardinalities of sumsets in additive combinatorics, called the Plunnecke-Ruzsa inequalities. Specifically, in 
Theorem 6 we observe that, if X, Y \, I 2 > ■ ■ ■, Y n are independent, then, 

( n \ n 

X-J2 Y i)+(n- l)h(X) <J2KX~ Yi). 

i=1 ) i =1 

We then examine the question of how different the entropies of X + X' and X — X' can be, when X and X' 
are independent and identically distributed (i.i.d.). As was pointed out by Lapidoth and Pete [31], the difference 
between the two can be arbitrarily large, which may be rephrased as saying that dn(X\\X') and d r(X\\ — X') can 
differ by an arbitrarily large amount. However, in Corollary 3 we show that the ratio between these two Ruzsa 
divergences is always bounded between 1/2 and 2; this generalizes the doubling-difference inequality of [30]. In 
Theorem 7 we give the general version of the sum-difference inequality [30], relating h(X + X') and h(X — X') 
[equivalently, relating dn(X\\X') and dji(X\\ — X') \ when X and X' are independent but not necessarily identically 
distributed. We close this section by giving general versions of some recent results by Wu, Shamai and Verdu [5 ] 
on discrete random variables, which were used in a study of the degrees of freedom of the M -user interference 
channel. In Lemma 6 and Theorem 8 we state and prove corresponding results for the entropy of weighted linear 
combinations of random variables of the form aX + bY, where X, Y take values in a general (locally compact and 
Polish) abelian group, and a, b are integers. 

In Section V, we consider the special cases of three subgroups G of the complex plane C, equipped with the 
multiplication operation: the half-line (0,00), the unit circle T C C, and the nonzero complex numbers C\ {0}. 
In each of these cases, the application of our general results lead to new inequalities for the differential entropies 
of products and ratios of G-valued random variables. 

In the last four sections we concentrate on the special case of real random vectors, taking G = M" and h to be 
the usual (joint) differential entropy. In Section VI we look at the difference between h(X + X') and h(X — X') 

1 Although a symmetrical variant of this quantity, namely i(d/i(X||V) + df{(y||A')), has been called the “Ruzsa distance” has been 
studied before in the discrete setting by Tao [54] and for real-valued random variables in [30], we find that focusing on this non-symmetric 
version makes various developments clearer. Furthermore, the Ruzsa divergence is a particular instance of the Kullback-Leibler divergence 
or relative entropy, so that it inherits many of its characteristics, but it also has special properties that justify its close study. 
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from a different perspective, and provide results in the spirit of the Freiman-Green-Ruzsa inverse sumset theorems. 
In Corollary 7 we show (under certain conditions), based on a recent result from [7], that if h(X + X') — 2h(X) is 
small, then the distribution of X is necessarily close to being Gaussian, in a way that can be precisely quantified 
in terms of relative entropy. Then, in Theorem 10 we prove a converse result: If the two entropies h(X — X') and 
h(X + X') are significantly different, then the distribution of X will also be significantly different (in the relative 
entropy sense) from being Gaussian. These results can be seen as quantitative versions of the condition for equality 
in the entropy power inequality [50], [51]. Recall that, when applied to i.i.d. random vectors X,X', the entropy 
power inequality implies that, 

77 

h(X + X') > h(X) + - log 2, 

where, throughout the paper, log denotes the natural logarithm log e , so that the entropy and all other familiar 
information-theoretic quantities are expressed in nats. In Section VII we establish a reverse inequality of this sort: 
Corollary 8 states that, if X, X' are i.i.d. with a log-concave distribution, then, 

h{X+X') < h{X)+n\og2. 

In Section VIII we argue that the Ruzsa divergence is a natural analog of volume-based functionals that arise 
in the geometry of convex sets. In Corollary 9 we establish the following information-theoretic analog of the 
Rogers-Shepard inequality: If X and X' are i.i.d. with a log-concave distribution on M", then, 

h(X -X')< h(X) + 2 n log 2. 

In fact, we conjecture that the same result holds without the factor of 2 in the last term above. Finally, in Section IX, 
we briefly indicate how the earlier inequalities for the entropy can be used to develop corresponding inequalities 
for the determinants of positive-definite matrices. In particular, in Corollary 11 we establish the following variant 
of an inequality due to Rotfel’d [45]: If K, K\. K 2 , • • •, K n are positive-definite matrices, then, 

n 

det(A' + K 1 + ... + K n ) < [det(if)]“( n_1) JJdet (K + Kj). 
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II. The Ruzsa divergence 

We begin by introducing the basic definitions of Haar measure and random variables with values in an abelian 
group. Readers not interested in the general formulation can simply skip to the two main examples below, and read 
the rest of this paper keeping only these two key examples in mind. 

Let G be an abelian topological group, i.e., a topological space endowed with a commutative, associative and 
continuous operation (i.e., a continuous function from G x G to G that takes (a;, y) to an element of G denoted 
x + y), which has an identity element 0 (such that x + 0 = x for all x in G) and with every element having an 
inverse (i.e., for each x € G there is an element in G denoted — x such that x + (—a;) = 0). We will always assume 
that the topology on G is Polish (i.e., it is metrizable so that the resulting metric space is complete and separable), 
and locally compact (i.e., every point has a compact neighborhood). The Borel a-algebra Q on G is the c-algebra 
generated by all open sets. It is a classical fact (see, e.g., [ 25 ], [ 40 ], [ 24 ]) that under these assumptions, there exists 
a (countably additive) measure A defined on Q that is translation-invariant, i.e., such that A (.4 + x) = \(A) for 
each A € Q and each x € G, where A + x = {a + x \ a G A}. Such a measure is called a Haar measure, and it 
is unique up to scaling by a positive constant. In any given situation, we will assume that the scaling is chosen at 
the beginning and fixed; thus we will talk without further comment about “the” Haar measure on G. 

For our analysis, the normalization (particular scaling chosen) of the Haar measure does not matter. Nonetheless, 
it is useful to keep in mind the common normalizations used for the most important examples - namely discrete 
groups and the additive group M". When G is a countable group with the discrete topology, we will always take 
the Haar measure A to be counting measure, i.e., A({g}) = 1 for every element g £ G, and define A on any 
subset of G as its (possibly infinite) cardinality. When G is not compact, the Haar measure is infinite, and then 
it is common to fix the normalization by fixing the measure of some special set; in the case of M”, as usual, by 
requiring A([0, l] n ) = 1, we obtain the Lebesgue measure. 

Let (H, T, P) be a probability space, and A be a G-valued random variable on it (i.e., a function from Ll to G 
measurable with respect to T and Q). We say that the random variable X taking values in G has a continuous 
distribution if its probability distribution, namely the image measure Px induced by the mapping A on 6, is 
absolutely continuous with respect to the Haar measure A. In this case, denoting the Radon-Nikodym derivative 
^yff{x) by f{x) = fx(x), we say that X has density /, and write X ~ /. 

Example 1. When G is countable, every G-valued random variable X has a continuous distribution, and its density 
is simply the probability mass function of X, i.e., fx (x) = \P{X = a; }. 

Example 2. Let G be the set W 1 equipped with the addition operation, so that A is the usual Lebesgue measure. Let 
X be a G-valued random variable. If X is a continuous random variable, then its density is the usual probability 
density function fx • G [0, oo) of the random vector X with respect to Lebesgue measure, satisfying, 

P {XeB} = f f x (x)dx, 

Jb 

for each B € Q, where Q is the collection of Borel subsets of M n . 

If X has density / on the group G, the entropy of X is defined by, 

h(X) = - [ f (x) log f(x) dx, 

Jg 

provided that the integral exists in the Lebesgue sense. As usual, we write h(X) even though the entropy depends 
only on the density / of X. Clearly, h is precisely the discrete entropy in the setting of Example 1, and the 
differential entropy in the setting of Example 2. 

To summarize, we assume throughout that G is a Polish, locally compact, abelian group, equipped with the 
Haar measure A on its Borel cr-field Q. Then it is easy to check that the same properties are satisfied by the 
Cartesian product G n (with coordinate-wise addition defining the group structure, the product topology defining 
the topological structure, and the product measure \ n being its Haar measure), for any Thus we can define 

the entropy of any finite collection of jointly distributed random variables (X \,... ,X n ), each with values in G, 
simply by treating (X\, ..., X n ) as a measurable function from Q to the Cartesian product G" , and computing the 
entropy of its density. [Generally we will not use the common term “joint entropy,” since we prefer to think of 


SUBMITTED TO TF. F.F. TRANSACTIONS ON INFORMATION THEORY, 2015 


6 


the collection of random variables as a single random object.] In particular, we can define the conditional entropy 
between two G-valued random elements X and Y by the usual chain rule, as h(Y\X) = h(X,Y ) — h(X). 

Although particular care is needed to see which of the standard properties of discrete entropy and differential 
entropy carry over to the general case, we note that it is immediate from the definition that some key properties 
remain true. First, the entropy is always translation-invariant in that, for any constant a € G, h(X + a) = h(X), 
because of the translation-invariance of the Haar measure. Also, the chain rule holds in general, and, if we define 
the mutual information as usual as a difference of entropies, the chain rule for mutual information also holds in this 
general setting. Finally, the property which will play the most central role in our subsequent development, namely 
the data processing inequality for mutual information, also holds in complete generality. 

Definition 1. Suppose X and Y are G-valued random variables with finite entropy. The Ruzsa divergence between 
X and Y is defined as, 

d R (X\\Y) :=h{X' -Y')-h{X'), 

where X' and Y' are taken to be independent random variables with the same distributions as X and Y, respectively. 

Let us note that even though the entropies of X and Y above are assumed to be finite, it is possible that h{X'—Y') 
and hence d R (X\\Y) are +oo (see, e.g., [12] for examples). In order to avoid uninteresting technicalities, in the 
statements of all subsequent definitions and results, we will always implicitly assume that the entropies and Ruzsa 
divergences that appear are well-defined and finite. The adjustments that need to be made to address possible 
infinities are left to the reader; see, e.g., the discussion after Lemma 7 where we work out explicitly the precise 
finiteness conditions in one particular case. 

A more precise way of writing the Ruzsa divergence would have been to write it as d R (f\\\f 2 ), where X ~ f\ 
and Y ~ / 2 , but we find it convenient to highlight the random vectors in the notation. The term “divergence” is 
designed to invoke comparison with the relative entropy or Kullback-Leibler divergence (in that d R also satisfies 
some properties of a distance but not others, e.g., it is not symmetric); in fact, it is immediately obvious that the 
Ruzsa divergence is just a special case of the mutual information (and hence of the relative entropy). 

Lemma 1. For any two G-valued random variables X, Y, 

d R {X\\Y) = I{X' -Y'-Y'\ 

where I{Z ; W) = h(Z) + h(W) — h ( Z. W) denotes the mutual information between Z and W, and X' ~ X and 
Y' ~ Y are independent. In particular, dffiX. Y) > 0 . 

Observe that d R (X\\X) = I(X — X'\X), where X' is an independent copy of X, and this is rarely identically 
zero. In particular, when G = R n , d R (X\\X) is never zero, since the entropy power inequality implies a strictly 
positive lower bound on d R (X\\X) depending only on n, as discussed in Section VI. Thus even if we ignore the 
assymmetry of Ruzsa divergence (which can be fixed by averaging d R (X\\Y) and d R (Y\\X)), one should be careful 
in interpreting it as a notion of distance. 

However, the quantity d R satisfies a triangle inequality. 

Theorem 1 (Triangle inequality for Ruzsa divergence). If X 1 ,X 2 ,X 3 are independent, then, 

d R (X\ ||X 3 ) < dflpTillXa) + d R (X 2 \\X 3 ). 

Theorem 1 was proved originally (in an equivalent form) for discrete random variables by Ruzsa [49]; see also 
Tao [54]. Since the discrete arguments used in these proofs rely on the property of submodularity which fails in the 
continuous setting, a different proof for Theorem 1 was recently provided in [30] for real-valued random variables. 
The proof we present for the general setting in Section III uses both a re-interpretation of the approach used in 
[30], and a sufficient condition for bijections in locally compact abelian groups to preserve the entropy, recently 
obtained in [35] and stated in Lemma 5. 

We now define a conditional version of the Ruzsa divergence. Throughout this paper, we say that X -H- Z -fA Y 
form a Markov chain if they are defined on a common probability space and the conditional distribution of X given 
(Z,Y) is the same as that of X given Z alone. The assertion that X -H- Z -fA Y form a Markov chain is easily 
seen to be symmetric, i.e., it is equivalent to the statement that Y -fA Z -H- X form a Markov chain. 


SUBMITTED TO TF. F.F. TRANSACTIONS ON INFORMATION THEORY, 2015 


7 


Definition 2. Suppose X\, Y, and X 2 are G-valued random variables, such that X\ <K> V' X 2 forms a Markov 
chain. The conditional Ruzsa divergence between X\ and X 2 given Y is, 

dRiX^XzlY) := h(X 1 - X 2 \Y) - h^Y). 


Lemma 2. If X\ Y X 2 form a Markov chain, then, 


d R (X 1 \\X 2 \Y) = I(X 1 -X2-,X 2 \Y), 


where I(Z ; W|V) = h(Z\V) + h(W\V) — h(Z, W| V) denotes the conditional mutual information between Z and 
W, given V. In particular, d R (Xi\\X 2 \Y) > 0 . 

Proof: Observe that, 


d R {X 1 \\X 2 \Y) = h{X 1 
= h(X 1 
= h(X 1 
= I(X 1 
0 . 


X 2 \Y) - h(X!\Y) 

X 2 \Y) - h{X 1 \Y,X 2 ) 

X 2 \Y)-h(X 1 -X 2 \Y,X2) 

X 2 ;X 2 \Y) 


The Markov condition was used in an essential way in the second equality of the above display, while the translation- 
invariance of entropy was used in the third equality. ■ 

Observe that dji(Xi\\X 2 \Y) f dR(X 2 \\Xi\Y) in general, but that both quantities are non-negative. 

Finally we introduce a more general version of the Ruzsa divergence, involving dependent random variables. 

Definition 3. The Ruzsa difference of the two G-valued random variables X and Y is, 

d R (X\\Y) :=h(X-Y)-h(X). 

Clearly, d R (X\\Y) = d R (X\\Y) when X and Y are independent, but in general d R (X\\Y) is not a divergence 
and need not be non-negative. Indeed, it is easy to see that one always has the following identity. 

Lemma 3. For any pair of X , Y, 

d R (X\\Y) = I(X -Y-Y) - I(X-Y). 
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III. Properties of Ruzsa divergence 

A special case of the Markov chain condition X\ ga Y GA X 2 is when X\ is independent of ( Y. X 2 ). Then, the 
conditional Ruzsa divergence can be related to the (unconditional) Ruzsa divergence. 

Lemma 4. (CONDITIONING REDUCES RUZSA DIVERGENCE) If X 1 is independent of (Y. X 2 ), then, 

d R (X 1 \\X 2 ) = d R (X 1 \\X 2 \Y) + I(Y-X 1 -X 2 ), 

and, in particular, d R (Xi\\X 2 \Y) < d R (Xi\\X 2 ). 

Proof: By Lemma 2 and the chain rule for mutual information, 

d R (x 1 ||x 2 |y) = i{Xi-x 2 -,x 2 \y) 

= IiXi -X 2 : (X 2 ,Y))~ I(X 1 - X 2 ; Y). 

But, 


/(*! - ^2; {x 2 , y)) = h(x 1 - x 2 ) - h{x 1 - x 2 \x 2 ,y) 

= / l (X 1 -X 2 )-/ l (X 1 |X 2 ,F), 


by translation-invariance of entropy. The assumed independence now implies that, 


J(Xl-X 2 ;(X 2 ,:K)) = h(X 1 - X 2 ) - h(Xt) 
= daiXi\\X 2 ), 


so that, 


dniX^X^Y) = d fi (X 1 ||A 2 )-/(y;X 1 -X 2 ) 


< d R (X i||X 2 ). 


To motivate the next property of Ruzsa divergence we will develop, it is useful to consider the special case 
G = R n , equipped with Lebesgue measure. In this case, it is an elementary fact that for any matrix A G GL n (R) 
(i.e., any invertible nxn matrix), and for any random vector X taking values in R n , h(AX) = h(X) + logdet(A), 
where det(-) denotes the determinant. This has two useful consequences. Firstly, d R {X\\AY) = d R (A _1 X\\Y) so 
that in particular, d R (X\\ — Y) = d R (—X\\Y). Secondly, for any matrix A G SL n ( R) (i.e., any invertible matrix 
with determinant 1), entropy is preserved by the corresponding linear transformation, i.e., h(AX) = h(X). 

For a general locally compact abelian group G, the notion of a linear transformation on G n defined by a matrix 
A no longer makes sense. However, when the elements of an n x n matrix A = {aij)\<i.j< n are integers, we can 
talk about the group homomorphism induced by A on G n . Specifically, for (xi,... ,x n ) =iG G n , we denote by 
Ax the element, 



where ax denotes the element x + • • • + x G G, added a times. Even though G n is not a linear space, we will 
sometimes call an integer matrix A a “linear transformation,” with the understanding that this refers to the group 
homomorphism induced by it as above. 

The general linear group over the integer ring Z (strictly speaking, of the module IP), denoted GL n { Z), is the 
set of all n x n matrices with integer entries and determinant +1 or —1. The following result was recently shown 
in [35]. 

Lemma 5. Let X be a random variable taking values in G n : If A G GL n ( Z), then, 

h{AX) = h(X). 

This allows us to extend the observation that the Ruzsa divergence behaves nicely when the random vectors 
involved are linearly transformed. 
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Corollary 1. For any A € GL n (Z), and any pair of G-valued random variables X, Y, 

d R (X\\AY) = d R (A~ l X\\Y). 

In particular, d/j(X|| — Y) = d R (—X\\Y). 

Proof: Assume, without loss of generality, that X , Y are independent. By Lemma 5, 

d R {X\\AY) = h(X — AY) — h(X) 

= h(A~ 1 X — Y) — h(A~ 1 X) 

= d R (A- l X\\Y). 


We now prove a sharpened version of the triangle inequality in Theorem 1. 

Theorem 2. If X\ , X 2 , X 3 are independent, then, 

d R (X i||X 3 ) < d R {X 1 \\X 2 \X 2 -X 3 ) + d R {X 2 \\X 3 ). 

Proof: By an application of Lemma 1 and the data processing inequality for mutual information, 

d R {X 1 ||X 3 ) = 

< /((X 1 -X 2 ,X 2 -X 3 );X 3 ). 

By the chain rule for mutual information, however, 

/((X 1 -X 2 ,X 2 -X 3 );X 3 ) 

= I[X 2 - X 3 ; X 3 ) + /(X, - X 2 ; X 3 |X 2 - X 3 ) 

= d R (X 2 ||X 3 ) + /(X, - X 2 ;X 3 |X 2 - X 3 ), 

where we used Lemma 1 in the last equality. All that remains is to show that, 

d R (X 1 ||X 2 |X 2 - X 3 ) = I(X 1 - X 2 ; X 3 |X 2 - X 3 ), 
or, in view of Lemma 2, that, 


/(X, - X 2 ; X 2 |X 2 - X 3 ) = /(Xi - X 2 ; X 3 |X 2 - X 3 ). 


(3) 


Let us observe the following general fact: 


To see this, write, 


I(X;Y,Y-Z) = I(X-,Y,Z). 


(4) 


I(X;Y,Y - Z) 


h(Y , Y-Z)- h(Y, Y - Z |X) 
h(Y, Z) - h(Y, Z\X) 

I(X',Y, Z), 


where the second identity relied on Lemma 5, and the fact that the mapping of (y, z ) to (y. y — z) is represented 
by the 2x2 matrix. 


1 

1 


0 

-1 


which has determinant 1. 
Then (4) implies that, 


I{X l - X 2 ; X 2 , X 2 - X 3 ) = /(X, - X 2 ; X 3 , X 2 - X 3 ), 


since both these quantities equal I{X\ — X 2 ; X 2 , X 3 ). Subtracting J(Xi — X 2 ; X 2 — X 3 ) from both sides, we obtain 
the inequality in (3), completing the proof. ■ 
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Remark 1. Using Lemma 4, Theorem 2 can be written in a symmetric form as, 

d R (X 1 ||X 3 ) < d R (X 1 \\X 2 ) + d R (X 2 \\X 3 ) - I{X l - X 2 -X 2 - X 3 ), 
and Theorem 1 immediately follows. 


A useful property of Ruzsa divergence is subadditivity in the second argument, which may be 
expressed as a monotonicity property in the first argument. 


equivalently 


Theorem 3. If X, Y\ and Y 2 are independent, then, 

d R {X ||li + Y 2 ) < dniXWYr) + d R (X\\Y 2 ). 
Equivalently, if X \, X 2 and Y are independent, then, 

d R (x 1 +x 2 ||y)<d R (x 1 ||r). 


Proof: Observe that, 

d R (X\\Y 1 +Y 2 )-d R (X\\Y 1 ) 

= h(X - Y\ - Y 2 ) - h{X) - [h{X - Yi) - h(X)] 
= h(X -Y- Y 2 ) - h(X - Yi) 

= d R (X-Y 1] Y 2 ). 


By relabeling variables, we see that the two formulations are equivalent. 

To prove the second formulation (and hence also the first), note that by Lemma 1, and the data processing 
inequality and the chain rule for mutual information, 

d^ + X^Y) = I{X l + X 2 -Y-Y) 

< I(X x -Y,X 2 -Y) 

= I(X\ — Y-,Y) + I(X 2 ;Y\X\ — Y). 

The second term in the last line is 0 since X 2 is independent of (X\. Y), so that another application of Lemma 1 
gives the desired result. ■ 


Remark 2. Written out in terms of entropies, Theorem 3 is equivalent to the assertion that the entropy of a 
sum of independent group-valued random variables is a submodular set function, i.e., h(X + Y + Z) + h(Z) < 
h(X + Z) + h(Y + Z). For discrete entropy, this assertion is implicit in Kcbmanovich and Vershik [27], and explicitly 
and independently developed in [32], [34]; [34] also contains a generalization from sums to a more general class 
of so-called partition-determined functions that can make sense on sets with less algebraic structure. For differential 
entropy, this assertion was first presented in [32], and further explored for the case of M-valued random variables 
in [30]. 

If we do not make assumptions about the nature of the underlying distributions, the Ruzsa divergence and 
conditional Ruzsa divergence can be unbounded. In Sections VII and VIII, we will make such assumptions and 
demonstrate a uniform bound on Ruzsa divergence for a log-concave density on R n . On the other hand, it is possible 
to obtain a bound on conditional Ruzsa divergence under mild assumptions on the dependence structure. 

Theorem 4. If X\ CA Y -H- X 2 form a Markov chain, then, 

dniXflX^Y) < 2I(X 1 -Y) + I(X 2 -Y) + d R (X 1 \\Y) + d R {Y\\X 2 ). 

Proof: Let (X\, Y. X 2 ) and (X\. Y", X 2 ) be conditionally independent versions of (X x , Y, X 2 ), given (X x ,X 2 ). 
By the data processing inequality: 

I(X x — X 2 \ X\\Y) 

< I(X l +Y",X 2 + Y"-,X l \Y) 

= h(Xi\Y) + h(X i + Y", X 2 + Y"\Y) 

-h(X 1 + Y",X 2 + Y",X 1 \Y) 

= h{Xi\Y) + h(X 1 + Y ",X 2 + Y"\Y) - h(X 1 ,X 2 ,Y"\Y), 
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where the last equality follows from Lemma 5, and the fact that the linear map (x±,x 2 ,y) (aq + y,x 2 + y, x{) 
has determinant —1. Therefore, 


h(X r - X 2 \Y) = h(X 1 - X 2 \X 1 ,Y) + I(X 1 - X^X^Y) 

= h{X 2 \Y) + I(X 1 -X 2 -X l \Y) 

< h(X u X 2 \Y) + h{X ! + Y",X 2 + Y"\Y) 

-h{Xi,X 2 ,Y"\Y). 

We have established that, 

h(X!,X 2 ,Y, Y") + h(X 1 - X 2 ,Y) < h(Xi,X 2 , Y) + h(X\ + Y", X 2 + Y", Y). (5) 

We now deduce the result from (5). First note that by conditional independence of Y and Y" given Xj. X 2 , the 
first term in the left-hand of (5) is, 

h(X r ,X 2 ,Y, Y") + h(Xt,X 2 ) = h(X v , X 2 ,Y) + h(X 1 ,X 2 , Y") = 2 h{X l , X 2 , Y ), 

so that, 

h(X 1 - X 2 ,Y) < h(X 1 + Y",X 2 + Y", Y) 

-fi(x 1 ,x 2 ,y) + fi(x 1 ,x 2 ) 

< X /l (*i + ^) + 

i 

-h(X 1 ,X 2 ,Y) + h(X 1 ,X 2 ). 


By conditional independence and the chain rule, 

h(X u X 2 ,Y) = h(X u X 2 \Y) + h(Y) 

= h{X l \Y) + h(X 2 \Y) + h(Y). 


Thus, 


h(X 1 - X 2 \Y) + h(Y) < X h ( x i + Y ) + KY) + h(Xi,X 2 ) 

i 

~[h(Xi\Y) + h{X 2 \Y) + h(Y)\ 

= h{X 1 +Y)-h(X 1 \Y) 

+h(X 2 + Y) - h(X 2 \Y) + h(X U X 2 ). 


So, 


h(Xi - X 2 \Y) - h^X^Y) 

< I(Xi;Y) + MX^Y) + h(X 2 + Y)- h{X 2 \Y) + h(X u X 2 ) - h{X u Y). 

Since, 

h(X 2 \Y) - h(X l ,X 2 ) + h{X l ,Y) = h(X\, X 2 \Y) + h(Y) - h(X 1 ,X 2 ) 

= h(Y\X 1 ,X 2 ) 

= h(Y\X 2 )-I(Y-X 1 \X 2 ), 

and since, 


I{Y;X i|X 2 ) = h(X 1 \X 2 ) - h(X 1 \Y,X 2 ) 
= h(X i|X 2 ) - fi(Xi| Y) 

< h(X!) - hiX^Y) 

= I(XV,Y), 


we are done. 
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Let us note two corollaries of Theorem 4. Firstly, if we assume Xi, X? and Y to be independent, we recover the 
Ruzsa triangle inequality (Theorem 1). Secondly, the case where the joint distribution is symmetric in (X\. X 2 ) is 
of interest. 

Corollary 2. Suppose X \ U f U X 2 form a Markov chain, and X 1 and X 2 have the same conditional distribution 
given Y. Then, 

d R {X 1 \\X 2 \Y) < 3 1(X; Y) + d R (X\\Y) + del'll*). 

One may interpret this as follows. For every possible value y of Y, consider the Ruzsa divergence between the 
conditional distribution of X given Y = y, and itself; then the conditional Ruzsa divergence d R {Xf\X 2 \Y) is the 
average of these quantities under the distribution of Y. This follows from the fact that X \, X 2 are conditionally 
i.i.d. given Y. Thus Corollary 2 says that, for weakly dependent random variables X, Y, having bounds on the two 
(not particularly well behaved) Ruzsa differences between X and Y, allows one to get a bound on this averaged 
self-divergence of the conditional distribution of X given Y (which is a well behaved divergence). 

Let us recall the Balog-Szemeredi-Gowers theorem, which has become an extremely useful tool in additive 

combinatorics in the last two decades. There are several formulations, but the one we focus on is stated in terms 

E 

of the restricted sumset A + B, defined as, 

E 

/I -r /> {ft T b cl £ A, b £ B, (ft, 6) £ Ef 

where E is some subset of the Cartesian product A x B. If A and B are finite nonempty subsets of an abelian 

1 E , - 

group G, and E C A x B satisfies |£j > j^\A\ ■ \B\ and \A + B\ < K^f |H| • \B\ for some K > 1, then there 
exist subsets Aq C A and Bo C B such that |Ho| > |f?o| > ^|B|, and 

\Ao + B 0 \<K 7 ^/\Ao\-\B 0 \. 

The natural probabilistic analogue of a restricted sumset is a sum of dependent random variables. Theorem 4 may be 
thought of as an information-theoretic form of the Balog-Szemeredi-Gowers theorem, since bounds for dependent 
random vectors are used to deduce bounds for (conditionally) independent random vectors. It is not directly 
analogous to the Balog-Szemeredi-Gowers theorem since the bounds are not in terms of the Ruzsa differences 
between X\ and X 2 , but rather in terms of the Ruzsa differences between either of them and the auxiliary random 
variable Y. However, such a direct analogue can be constructed using Theorem 4. This was done in the discrete 
case by Tao [54], and in the case of the additive group R by the authors in [30]. We state below the resulting 
theorem in the general setting, using the notation developed in this paper. 

Theorem 5. Let (X 2 , Yj, X\, Y 2 )form a Markov chain, with the marginal distributions of the pairs (X 2 , Yj), (Xi ,Y\ ) 
and (Xi,Y 2 ) all being the same as the distribution of (X,Y). Then, 

dR(X 2 \\Y 2 \X 1 ,Y 1 ) + d R (Y 2 \\X 2 \X 1 ,Y 1 ) < 3I(X; Y) + d R {X\\Y) + d R {Y\\X). 

Proof: The proof of [30, Theorem 3.14] for real-valued random variables carries over almost exactly in the 
general case, if one uses Lemma 5 to justify one of the steps. This yields, under the present assumptions, that, 

I(x 2 + Y 2] Y 2 \X 1 ,Y ] ) + I(X 2 + Y 2 -,X 2 \X 1 ,Y 1 ) < I(X ; Y) + I(X + Y-X) + I(X + Y : Y ). 

To obtain the desired result in the stated form, one just needs to replace all occurrences of Y. Y\ or Y 2 by their 
respective inverses (i.e., —Y. —Y\ or —Y 2 ), and then make appropriate use of Lemma 1, Lemma 2, and Lemma 5. 
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IV. Entropies of weighted sums and differences 

The Pliinnecke inequality in additive combinatorics [43], [46], [47] states that, if \A + B\ < a\A\ for finite 
nonempty subsets A, B of an abelian group, then for every k > 1, there exists a nonempty subset A' C A such 
that 

\A' + kB\<a k \A'\, (6) 

where kB refers to the sumset B + • • • + B with k summands. A very elegant and considerably simpler proof, 
obtained by Petridis [41], also shows that the same subset A' can be used for all positive integers k. The inequality 
(6) can be generalized to different summands: if A and Bi are nonempty finite sets, with ,4 + B t \ < a t \ A for 
each i, then there exists a nonempty subset A' C A such that, 

( m 

i= 1 

This is usually called the Pliinnecke-Ruzsa inequality, since it was proved by Ruzsa [46], [47] using an ingenious 
combinatorial argument. These inequalities are very influential in additive combinatorics- for example, as expounded 
in [55], they are sufficient to obtain Freiman-type inverse theorems for groups with bounded torsion. The analogue 
of the Pliinnecke-Ruzsa inequality for the entropy is the following subadditivity property of Ruzsa divergence, 
which is an immediate consequence of Theorem 3; the same historical remarks made in Remark 2 therefore also 
apply here. 

Theorem 6. If X,Y\,, V/. are independent, then: 

k \ k 

<Y,d R (xm). 

i =1 ' i =1 

To see that this is analogous to the Pliinnecke-Ruzsa inequality as stated above, we can trivially rewrite it in 
the following form: if d R {X\\Yi) < cu, then d R (X\\ ffi=i ^i) < Yl= i a i- Unlike in the case of sets where one 
potentially needs to pass to a subset to obtain a valid inequality, the entropy analogue works with the original 
random variables of interest. 

The properties of Ruzsa divergence developed in Section III can also be used to understand how the differential 
entropy of the sum of two independent random vectors constrains the differential entropy of their difference. 

Theorem 7. For any G-valued random variables X, Y, 

d R (X || - Y) < 2d R (X\\Y) + d R (Y\\X). 

Proof: Let {X\, Y\ ) be independent, with Z = X\ — Y\. Assume (X 2 , Y 2 ) is conditionally independent of 
(X\ . Y\ ) given Z, and has the same conditional distribution given Z as (Xi, Y\); thus in particular Z = X 2 — Y 2 . 
Let (X,Y) be independent of (X]. Y\. X 2 , Y 2 ), but have the same distribution as either pair (X, . Yf. 

Since, by construction, X\ — Y\ = X 2 — I 2 = Z, 

X + Y = X + Y + (X 2 -Y 2 )-(X 1 -Y 1 ) 

= (X-Y 2 )-(X 1 -Y) + X 2 + Y 1 , 

and hence, by data processing for mutual information, 

I(X;X+Y) < I(X-,X-Y 2 ,X 1 -Y,X 2 ,Y 1 ) 

= h{X -Y 2 ,X 1 -Y,X 2 ,Y{) 

-h(X-Y 2 ,X 1 -Y,X 2 ,Y 1 \X) 

= h(X - Y 2 ,X 1 - Y, X 2 ,Y 1 ) - h(Z, Y u Y 2 , Y\X), 

where the last equality follows from the fact that the linear map, (z,yi,y 2 ,y,x) i-A (x — y 2 ,yi+z — y,y 2 + z,yi,x), 
has determinant 1. 


d R [X 
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Using the independence of X and Y from each other and all other random variables for the second term on the 
above right-hand side, we have, 

d R (Y || -X)< h{X - Y 2 ) + h(X 1 - Y) + h(X 2 ) + h(Y 0 
-[hfrYi^ + hiY)] 

= [d R (Y\\X) + h{Y)} + [d R (X\\Y) + h{X)} 

+ h(X 2 )-h(Z,Y 1 ,Y 2 ). (7) 

However, observe that, since I(Y\;Y 2 \Z) = 0, 

h(Z, Y U Y 2 ) + h(Z) = h(Yi, Z) + h(Y 2 , Z) 

= h{X 1 ,Y 1 ) + h(X 2 ,Y 2 ) = 2h(X,Y). (8) 

Plugging (8) into (7) gives, 

d R (Y\\-X) < d R (Y\\X) + d R (X\\Y) + h(Y) + 2h{X) 

-[2 h(X,Y)-h(Z)\ 

= d R (Y\\X) + d R (X\\Y) + h{Z) — h[Y) 

= 2d R (Y\\X) + d R (X\\Y), 

which is the desired result. ■ 

In the case where X and Y are not just independent but also identically distributed, Theorem 7 simply says that 
djl(X\\ — X) < 3d R (X\\X), while taking X and —Y to have the same distribution gives, 

d R (X\\X) < d R (X || - X) + 2d R (X\\ -X) = M r (X\\ - X). 

In fact, one can obtain tighter bounds in these special cases. 

Corollary 3. If X. Y are i.i.d., then: 

d R (X\\-X) , 

d R (X\\X) ^ [ 2’ J ' 

Proof: The desired statement is equivalent, for X\,X 2 that are i.i.d., to: 

1 h(X 1 +X 2 )-h(X 1 ) 

2 - h{X l - X 2 ) - h^Xx) - ' J 

As observed in [30], the upper bound in the inequality (9) follows from Theorem 6, and the lower bound follows 
from Theorem 1, both of which we have already proved for the general setting. ■ 

Corollary 3 provides inequalities between h(X + Y) and h(X — Y) when X,Y are i.i.d. and h(X) is known. 
The requirement to know h(X) to make the comparison cannot be dispensed with in the general setting of locally 
compact abelian groups. However, this requirement can be dispensed with for discrete groups- as observed by [1], 
h{X + Y)/h(X — Y) must lie between 3/4 and 4/3 if X and Y are i.i.d. random variables in a discrete group. 

Finally, let us examine what can be said about weighted sums and differences, i.e., about random variables of 
the form aX + bY where a, b are non-zero integers. Discrete entropy inequalities for such random variables play a 
key role in the recent work of Wu, Shamai and Verdu [57] on the degrees of freedom of the M-user interference 
channel - specifically, they immediately yield inequalities of similar form for the Renyi information dimension of 
weighted sums of random variables, which imply, using the single-letter characterization of [57], that for rational 
channel coefficients the number of degrees of freedom is strictly smaller than M/2. In the following theorem, we 
extend all the inequalities proved by [57] for discrete entropy of weighted sums and differences to the general 
abelian setting. First we give the generalization of [57, Lemma 18]. 

Lemma 6. Let X, X' and Z be independent G-valued random variables, where X' has the same distribution as 
X. Let a , b be nonzero integers. Then: 

h(aX + b)< h((a - b)X + bX' + Z) + d R (X\\X). 
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Furthermore, if a is even, then: 

h(aX + b) <h {^X + Z^j + h(2X - X') - h(X). 

Proof: One can simply follow the proof strategy of [57, Lemma 18], which on inspection relies only on the 
subadditivity of Ruzsa divergence and the Ruzsa triangle inequality, both of which we have already proved in the 
general setting. ■ 

Finally we give the generalization of [57, Theorem 14]; its proof is again the same as in the discrete case, using 
the subadditivity of Ruzsa divergence and Ruzsa triangle inequality established earlier. The result of Theorem 8 
can be compared to the inequalities of Bukh [18] for dilated sums of sets. 

Theorem 8. Let X and Y be independent G-valued random variables, and a,b be nonnegative integers. Then, 
h(aX + bY) - h(X + Y)< r a , b {d R (X || — Y) + d R (Y\\ - X)}, 

where, 

r a ,b = 6(|k>g|a|J + |k>g|&|J +2). 
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V. Entropies of products and ratios 

Since we will need to discuss entropies with respect to two different measures on the same group, we introduce 
some additional notation to keep things unambiguous. All the examples considered in this section involve subgroups 
G of the group C x = C\{0} equipped with the multiplication operation. The Haar measure for such multiplicative 
groups is typically not the same as the familiar Lebesgue measure used to compute differential entropies of real¬ 
valued or complex-valued random variables (the one-dimensional and two-dimensional Lebesgue measures are Haar 
measures for the groups R and C respectively, but only when the group structure comes from the addition operation). 


A. Positive random variables 

Consider the group M>o = (0, oo) equipped with the multiplication operation. Its Haar measure is given by, 

A {dx) = 

x 

where dx is Lebesgue measure on (0, oo). To see this, all we need to do is check the translation-invariance of A 
with respect to multiplication, i.e., that for any fixed c > 0, we have A (cA) = \{A) when 

«-/,* 

and dx represents Lebesgue measure on R. [And this in turn is an immediate consequence of the fact that the 
logarithmic function is an isomorphism between (R>o, x) and (R,+), using the standard translation-invariance of 
Lebesgue measure for addition.] 

We are interested in two entropies of a positive (i.e., R>o-valued) random variable X. To define them, let us 
assume that X has a density / with respect to Lebesgue measure on (0, oo). Then: 

1) The differential entropy of X is, 

r oo 

h*(X) = - f (x) log f(x)dx. 

J o 

2) The intrinsic entropy h x {X) with respect to Haar measure A on (M>o, x) is given by, 

( 10 ) 


h x (X) = - [xf{x)\\og[xf{x)\\{dx) = h R (X) - E[logX], 

J o 


since the density of X with respect to A is xf{x). We use h x to emphasize that this is the intrinsic entropy 
with respect to the multiplicative structure on M>o rather than the additive structure on R. 

Observe that M>o is a Polish, locally compact, abelian group to which all of our preceding results apply and 
yield statements of interest. Lor illustration, we only write out one consequence: Corollary 3 says that, 

1 < h x (XY)-h x (X) 

2 - h x (X/Y) - h x (X) ~ ’ 

which, using relation (10), translates to the following statement for the usual differential entropy. 

Corollary 4. If X,Y are i.i.d. random variables taking values in (0,oo), then: 

h R (XY) < 2h R {X/Y) — h R (X) + 3E[log2f], 
hm{X/Y) < 2h R {XY) — h R (X) — 3E[logX]. 


B. Random variables on the circle group 

Consider the unit circle T = {z € C : \z\ = 1} in the complex plane; this is of course a group under multiplication, 
and is isomorphic to R/Z equipped with addition via the isomorphism t i-a e 2mt . Alternatively we can parametrize 
T using the angle 6 subtended by the arc of the circle between the point on T and the real axis (which is just 2-Kt). 
With this parametrization, the Haar measure A is the uniform distribution on the angle or, equivalently, Lebesgue 
measure on [0, 2n). Lor a T-valued random variable 0 that has a density / with respect to the uniform measure, 

h(@) = - J f(x)\ogf(x)X(dx) = —D(Q\\U), 
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where U ~ A is uniformly distributed on T, and I)((~)\\U) denotes the relative between 0 and a uniformly distributed 
random variable U on T. Thus, the fact that entropy increases on convolution captures in this setting the fact that 
convolution brings any distribution closer to the uniform. 

In this case, Corollary 3 becomes the following statement. 

Corollary 5. If 0,0' are i.i.d. random variables taking values in T, then: 

1 D(Q\\U) — D(Q + Q'\\U) 

2 - D(Q\\U) - D(Q - ©'ll 17) “ ‘ 

C. Non-zero complex random variables 

Finally we consider the full group (C x , x), whose Haar measure is given by, 

dz 



where dz is 2-dimensional Lebesgue measure (using the identification of C with M 2 ). If / is the density of a 
C x -valued random variable Z with respect to 2-dimensional Lebesgue measure, one has the intrinsic entropy, 

r Hz 

hx (Z) = — / [\z\ 2 f(z)]log[\z\ 2 f{z)]—^ = hw{Z) - E[log(|Z| 2 )], 

J c x \ z \ 

where we use k®?{Z) to denote the usual differential entropy of Z. 

Then Corollary 3 becomes the following statement. 

Corollary 6. If Z\. Z 2 are Li.d. random variables taking values in C x , then: 

^-r 2 ( Z\Z 2 ) < 2h 9 ,{Z l /Z 2 ) - h^iZf) + 6E[log |Zi|], 

^-m 2 ( Z\ j Z 2 ) < 2h^2(ZiZ 2 ) — hmz(Zi) — 6E[log \Zi\j. 
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VI. Freiman-type results for the entropy on R n 

For the rest of the paper, our focus is on the additive group R n equipped with Lebesgue measure, so that h 
denotes the usual differential entropy. Our first observation is a uniform lower bound on the Ruzsa divergence 
between a distribution and itself. A simple application of the entropy power inequality [50][51] to two i.i.d. random 
variables easily gives the following result. 

Lemma 7. For any W 1 -valued random vector X with finite differential entropy, 

T) 

dn{X\\X) > — log 2. 

Furthermore, dn(X\\ — X) > §log2. 

The assumption of finite differential entropy in Lemma 7 is in fact essential. As shown by Bobkov and Chistyakov 
[12, Proposition 1], there exists a R-valued random variable X of finite entropy such that if X,X' are i.i.d., the 
entropy of X + X' does not exist. However, [12] also shows that for any such example, necessarily the entropy of X 
is —oo, so that it remains true that if the entropy exists and is a real number, then the entropy of the self-convolution 
also exists (although, thanks to another example constructed in [12], it may then be +oo!). Henceforth, as stated 
in Section II, if nothing is stated, we will assume that all entropies and Ruzsa divergences exist and are finite. 

We find it convenient to restate Lemma 7 in terms of the doubling and difference constants associated with a 
random vector. 


Definition 4. For an W 1 -valued random vector X, the entropy power of X is defined as, 

Af{X) =exp{^^}. 

For an W 1 -valued random vector X, the doubling constant is defined by, 


o + {X) 


Af{X + X') 
2M{X) ’ 


and the difference constant is defined by, 


°-{X) 


where X' is an independent copy of X. 


Af(X - X') 
2AffX) 


Then entropy power inequality immediately implies that if X has finite entropy, then er + (X) > 1 and <r_(X) > 1; 
this is just a restatement of Lemma 7 since, 

O-(X) = i^d R (X\\X)^, (11) 

and, 

o + (X) = ±expj^ R (X]|-X)j. (12) 

Furthermore, because of the equality conditions of the entropy power inequality, a+(X) (or o-(X)) is equal to 
1 if and only if X is a Gaussian (with non-singular covariance matrix). Note that the definitions of doubling and 
difference constants of scalar random variables in [54] (for discrete random variables) and in [30] (for R-valued 
random variables) used a different normalization, but we have chosen the normalization above so that the minimum 
value achieved at Gaussians for both cr + and cr_ is 1. 

A natural question is whether the extremality of Gaussians is a stable phenomenon. In other words, if o + (X) < K 
for some K, does this imply that the distribution of X is necessarily not far from being Gaussian, in a sense that can 
be quantified in terms of K2 It is a perhaps somewhat surprising result due to Bobkov, Chistyakov and Gotze [14] 
that the answer is “no,” even in the one-dimensional setting. Nonetheless, as observed in [30], under the additional 
assumption that X has a finite Poincare constant (and using results independently obtained by Johnson and Barron 
[26] and Artstein, Ball, Barthe and Naor [3] on the rate of convergence in the information-theoretic central limit 
theorem for R-valued random variables) it can be shown that such a stability bound can indeed be established. 





SUBMITTED TO TF. F.F. TRANSACTIONS ON INFORMATION THEORY, 2015 


19 


This result cannot be directly extended to the case of Revalued random vectors, since non-asymptotic bounds that 
exhibit convergence rates for the entropic central limit theorem in the multivariate case are not known under just 
the assumption of a finite Poincare constant 2 . However, by relying on recent work of Ball and Nguyen [7], one 
can see that such stability does hold under the stronger assumption of log-concavity. 

Recall that a probability density function / defined on M n is said to be log-concave if, 

f(ax + (1 - a)y) > /(x)“/(y) 1_ “, 

for each x, y € M n and each 0 < a < 1. If / is log-concave, we will also use the adjective “log-concave” for a 
random variable X distributed according to /, and for the probability measure induced by it. Note that the class of 
log-concave probability measures is quite broad, including the uniform distribution on any compact, convex set, the 
exponential distribution, and of course any Gaussian. On the other hand, log-concavity can also be fairly restricting: 
For instance, it implies at least exponentially decaying tails, and a finite Poincare constant. 

Now we state the main result of [7] we will need. For a random vector X ~ / we write D(X) for its relative 
entropy distance from a Gaussian, 

D(X) = D(f\\f G ) = h(f G )-h(f), 

where f G is the Gaussian density with the same mean and covariance matrix as /, and D is the usual relative 
entropy. 

Theorem 9. [ ] Suppose X is a log-concave random vector in M n , and that it satisfies a Poincare inequality with 
constant c, i.e., if for any smooth function u with E[u(X)\ = 0, 

cE[u{X) 2 } < E[\Xu(X)\ 2 }. 


Then, 


h 


X!+X 2 \ 
x/2 ) 


h{X) > 


4(1 + c) 


D(X), 


where X \ and X 2 denote independent copies of X. 


Simply rearranging the conclusion of Theorem 9 gives the following stability result. 


Corollary 1. If X is a log-concave random vector in M n , with Poincare constant c, then: 


D(X) 

n 


< 


2(1+ c) 
c 


log d + {X). 


Remark 3. It was proved in [10, Proposition V.6], by relying on an important result of Klartag [29], that log- 
concave distributions are not too far from Gaussianity, in the sense that, 

< ^ log n + C, 

for some absolute constant C. Therefore, the main value of the result in Corollary 7 is in that it explicitly connects 
“non-Gaussianity” with the doubling constant a + (X), and especially when (T + {X) is small. 



Interestingly, it is not hard to give a much more elementary converse result when we know something about 
both the doubling and the difference constants. Indeed, we show below that any random vector whose doubling 
and difference constants differ significantly, must also be significantly far from Gaussianity. 

Theorem 10. If X] and X 2 are independent copies of any random vector X in M n with finite differential entropy, 
then, 

> jl logcr + (7f) — log a-(X)\. 
n 4 


While asymptotic estimates of this sort are known [13], [15], [19], estimates that only hold for a sufficiently large number of summands 
are not strong enough for our purposes. 
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Proof: By the invariance of the entropy under linear transformations of determinant 1, 

h{X 1 ) + h{X 2 ) = h{X 1 ,X 2 ) 

r Xi +X 2 Xi-X 2 


= h 
< h 


V2 

X 1 + x 2 

V2 


V2 


+ h 


X\ X 2 

V2 


Let a be the greater of the quantities h ( Al ) and h ( Xl ^ 2 ), and b be the lesser of them. The above display 
implies that, 


a + b 


> h{X). 


(13) 


Now, by the scaling property of differential entropy, we have, 


h\ h(X 1 -X 2 )-h(X 1 +X 2 )\ = i 


X\ + x 2 

V2 


-h 


Xi - X , 


V2 


a — b a + b 

—r— = a --— 


using (13) to obtain the inequality. Since both 


< a-h(X), 

and 


have the same covariance matrix as X , the 


x/2 y/2 

maximum entropy property of the Gaussian implies that h(Z) > a, where Z is a Gaussian random vector with the 
same covariance matrix as X. Thus we have, 

\\h{X l - X 2 ) - h(X i + X 2 )\ < h{Z) - h(X) = D(X), 


which is equivalent to the desired statement by using the relations (11) and (12). 
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VII. An explicit reverse entropy power inequality 

In recent work [9], a reverse entropy power inequality was developed for the class of log-concave distributions. 
Recall that the entropy power inequality due to Shannon and Stam [50], [51] asserts that Af(X + Y) > Af(X) + 
Af(Y), for any two independent random vectors X and Y in M n for which the entropy is defined. The entropy 
power inequality may be formally strengthened by using the invariance of entropy under affine transformations of 
determinant ±1, i.e., X'(u(X)) = Af(X) whenever det (u) | = 1. Specifically, 

inf Af( Ul (X) +u 2 (Y)) > Af(X) +A f(Y), (14) 

Ui,U 2 

where the maps iq : R n —>• M n range over all affine entropy-preserving transformations. What [9] showed was 
that the inequality (14) can be reversed with a constant independent of dimension if we restrict to log-concave 
distributions. 

Theorem 11 (REVERSE EPI, [9]). If X and Y are independent random vectors in M n with log-concave densities, 
there exist linear entropy-preserving maps u, : M" —>• R n such that 

Ar(X + Y) < C(Af(X)+Af(Y)), (15) 

where X = ui(X), Y = U 2 (Y), and where C is a universal constant. 

This reverse entropy power inequality is analogous to Milman’s [37] reverse Brunn-Minkowski inequality (see 
also [38], [39], [42]), which is a celebrated result in convex geometry. In this light, Theorem 11 can be seen as 
an extension of the analogies between geometry and information theory that were previously observed by Dembo, 
Cover and Thomas [23], among others. Also, Theorem 11 can be extended to the larger subclass of so-called 
“convex measures” [11]. 

Observe that the universal constant provided by the proof of Theorem 11 is not explicit, and it is not easy to 
even get bounds on it. But in the special case when X and Y have the same distribution, we show below that an 
explicit constant can be obtained rather simply. To do this, we first note that the following result of Cover and 
Zhang [21] easily generalizes to higher dimensions: If X and X' are (possibly dependent) random variables with 
the same log-concave marginal distribution on R, then, h(X + X') < h(2X). 

Theorem 12. If X and Y are (possibly dependent) random vectors in R n , with the same log-concave marginal 
density, then, 

h(X + Y) < h(2X). 

Proof: Suppose the common marginal density of X and Y is /, and let g be the density of Z = X + Y. Since 
/ is log-concave, Jensen’s inequality implies that, 

£log/(^) > E\ [log f(X) -flog f(Y)} 

= \[E\ogf(X) + E\ogf(Y)} 

= 

Observe that independence is not required here, and all expectations are taken with respect to the joint distribution 
of (X, Y). In particular, we have that, 

f g(z) log f{z)dz = I g(z) log /(^) - 1 
> ~h(f) - 1 
= -Hf), 

where f(z) = \f(z/2) is the density of Z* = 2X. In other words, 

D(g\\f) + h(g)<h{f). 

Thus h(g) = h(X 4- Y) is maximized if and only if g = /, i.e., when X and Y are identical. ■ 
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Theorem 12 immediately implies that for i.i.d. random vectors with log-concave distribution, the reverse entropy 
power inequality (Theorem 11) holds with both linear transformations being the identity, and with a universal 
constant of 2. 

Corollary 8. If X, X' are independent random vectors with the same log-concave distribution, then, 

M(X + X') < 2[M{X) + M{X')\. 

In other words, for any log-concave random vector X, a+(X) < 2. 

Proof: From Theorem 12, 

M(X + X') < M{2X) = 4N(X) = 2[A f(X) + A f{X)]. 


A version of Corollary 8 was obtained (contemporaneously with this work) by a different method in [16]; however 
the bound on the doubling constant in that work is e 4 /2 ~ 27.3, which is significantly worse than the bound of 2 
we obtain. Soon after the first version of this paper was released, some related results and a nice conjecture about 
reverse forms of the entropy power inequality were released by Ball, Nayar and Tkocz [6]. 

Although it already seems rather restrictive that the doubling constant of any log-concave random vector lies 
between 1 and 2, we do not believe the upper bound is optimal. However, Corollary 8 represents yet another way 
in which general log-concave random vectors resemble Gaussian ones; as mentioned in Remark 3, [10] gives a 
different formulation of this intuition. 

Another way to view Corollary 8 is in the context of the central limit theorem. Recall that the central limit 
theorem in terms of relative entropy ([8], [4], see also [33]) asserts that if X, Xi, X 2 , ... are i.i.d. random vectors 
with h(X) > — oo, then, as n —>• oo, 


Corollary 8 implies that, 

<2JV(X), 


and hence constrains the rate at which entropy can increase when doubling sample size in the central limit theorem 
for i.i.d. log-concave summands. 

The above development is also closely related to a very nice observation of K. Ball, dating back to around 2003 but 
with details only being published much later in [7], relating two important conjectures in convex geometry, namely 
the Kannan-Lovasz-Simonovits conjecture [28] and the hyperplane conjecture or slicing problem of Bourgain [17]. 
We explain this connection in our language; the reasoning is related to that of K. Ball even if it differs in details. 
The Kannan-Lovasz-Simonovits (KLS) conjecture asserts that the Poincare constant c is bounded from below for all 
log-concave densities by a universal constant C independent of dimension. If this is true, then Corollary 7 implies 
that 


D(X) 

n 


< 2 



a+(X) < 2 



log 2, 


using Corollary 8 for the second inequality. In other words, D(X)/n is bounded by a universal constant for any log- 
concave random vector X in R n , which by [10, Corollary 5.3], is equivalent to the hyperplane conjecture (whose 
original formulation in [17] we do not bother to state here). Hence the KLS conjecture implies the hyperplane 
conjecture. 
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VIII. Towards a Rogers-Shephard inequality for entropy 
The Rogers-Shephard inequality [44] asserts that, if K C M" is a convex body, then 

/2 n \ 

Vol (K - K) < ( J Vol(A'), (16) 

with equality if and only if K is the n-dimensional simplex. It complements the fact, implied by the Brunn- 
Minkowski inequality, that, 

Vol (K -K)> 2 n Vol (K). (17) 


Indeed, since by Stirling’s formula and some algebraic manipulation, 



< 4 n , 


the inequalities (16) and (17) together imply, 

2Vol(A') 1/n < Vol (K - K) 1/n < 4Vol (K) 1/n . 


As suggested by the analogy between the reverse entropy power inequality and the reverse Brunn-Minkowski 
inequality discussed in the preceding section, the natural probabilistic analogue of a convex set is a log-concave 
distribution, and a natural probabilistic analogue of volume is entropy. Therefore, it is natural to ask whether there 
is a probabilistic analogue of the Rogers-Shephard inequality. Indeed, we show that for X , X' i.i.d. log-concave 
random vectors, Af{X — X') is bounded by a multiple of Af{X). 

Corollary 9. If X , X' are independent random vectors drawn from the same log-concave distribution, then 

Af{X - X') < 16Af(X). 

In other words, for any log-concave random vector X, (t_(X) < 8 . 


Proof: By Corollary 8, 


Af{X + X') < 4Af(X), 


and by Corollary 3, 


Af(X - X') < 


N 2 {X + X') 

W) 


< 16Af(X). 


Corollary 9 does not provide a tight bound. Indeed, in the contemporaneous work [16], a different approach is 
used to obtain a bound on the difference constant of e 2 /2 ~ 3.7, which is better than our bound of 8. We state 
below a conjecture for the sharp constant in the one-dimensional case. 

Conjecture 1. If X. X' are independent Fl-valued random variables drawn from the same log-concave distribution, 
then, 

N{X-X') < 4AT(X), 

with equality if and only if X is a translated and scaled version of the (one-sided) exponential distribution. In 
other words, for any log-concave random variable X, <r_(X) < 2. 

Of course, we may also write Corollary 9 and Corollary 8 in terms of the Ruzsa divergence using the identities 
(11) and (12). 

Corollary 10. If X is a log-concave random vector taking values in R n , then, 

dR(X\\X) < 2nlog2 and c(r(X|| — X) < nlog2. 

Let us note that a sharp functional analogue of the Rogers-Shephard inequality has been proved by Colesanti 
[20] for log-concave functions as opposed to densities (see also [5], [2]). 
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IX. Determinant inequalities 

Differential entropy inequalities have been used to to deduce inequalities for positive-definite matrices since 
Cover and El Gamal’s work in [22]; see also [23] and [36]. However, in most of these cases, the inequalities 
deduced relate determinants of a positive-definite matrix to those of its square submatrices. We discuss below the 
use of differential entropy inequalities to prove determinantal inequalities for sums of positive-definite matrices. 
As in the above papers, the main idea is to use the fact that, for the Gaussian distribution on M n with covariance 
matrix K, written 7 k = N(0,K), the differential entropy is given by, 


Kik) 


l lo s 


(27re) n det(AT) 


A classical inequality for the determinant of sums is Minkowski’s inequality, which asserts that, for n x n 
positive-definite matrices, 

det(A + B)~ > det(.A)» 4 - det(H)« . 


This may be seen as a consequence of the entropy power inequality (by specializing to Gaussians), but there are 
also elementary means of deriving it. 

On the other hand, upper bounds for the determinant of a sum of positive-definite matrices are not as well 
known. This is partly due to the fact that the most straightforward inequalities that one might try to check, like 
subadditivity, are actually false. However, Rotfel’d [45] did obtain such a bound when one of the matrices involved 
is the identity matrix: 

det(7 + A + B) < det(7 + A) • det(7 + B). (18) 

Indeed, he obtained this as a special case of a more general inequality for arbitrary square matrices, 

det(7 + \ A + 7?|)| < det(7 + |A|) • det(7 + |7?|), 
where |A| = y/ A* A and A* is the adjoint of A. 

Our final observation is that, substituting normals in Theorem 6, provides an extremely simple alternative proof 
of a generalization of inequality (18), not requiring any of the matrices to be the identity: 

Corollary 11. Let K and Kj be positive-definite matrices of the same dimension. Then: 

n 

det (77 + K 1 + ... + K n )< [det(7C)] - ( n-1) det (if + Kj). 
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